Skip to content

Conversation

@moonchen
Copy link
Contributor

@moonchen moonchen commented Sep 19, 2025

Adds a watchdog thread that warns when a net thread remains in the work phase longer than a configurable duration.

Config:
proxy.config.thread_watchdog.timeout_ms (default: 1000)

Why:
Net threads should not stall; doing so adds latency to all transactions multiplexed on that thread. Stalls may indicate a misbehaving plugin, overload, or a Traffic Server bug.

On trigger, a warning is logged with the offending thread number and elapsed time to aid targeted diagnostics.

Run a watchdog thread to find blocking events.
@moonchen moonchen requested a review from Copilot September 19, 2025 20:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a thread watchdog system to detect blocking events in the traffic server. The watchdog monitors network threads and alerts when they remain awake (not sleeping) longer than a configurable timeout threshold, indicating potential performance issues or blocking operations.

Key changes:

  • Introduces a Watchdog::Monitor class that runs in a separate thread to monitor event loop health
  • Adds heartbeat tracking to EThread instances to record sleep/wake timestamps
  • Configures the watchdog timeout through a new configuration parameter

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/traffic_server/traffic_server.cc Creates and manages the watchdog instance, integrating it into server startup/shutdown
src/records/RecordsConfig.cc Adds configuration parameter for watchdog timeout
src/iocore/eventsystem/Watchdog.cc Implements the watchdog monitoring logic
src/iocore/eventsystem/UnixEThread.cc Adds heartbeat updates to the event loop
src/iocore/eventsystem/CMakeLists.txt Includes the new watchdog source file in the build
include/iocore/eventsystem/Watchdog.h Defines watchdog interfaces and heartbeat structure
include/iocore/eventsystem/EThread.h Adds heartbeat state to the EThread class

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@moonchen moonchen marked this pull request as draft September 19, 2025 20:27
@moonchen moonchen marked this pull request as ready for review September 19, 2025 20:28
Because ubuntu 20
macOS 14, FreeBSD 13 don't have it
@moonchen moonchen changed the title Thread watchdog watchdog: alert on ET_NET thread stalls beyond threshold Sep 22, 2025
@moonchen
Copy link
Contributor Author

This is a PR that would be nicer if we had std::jthread and stop_token support across all platforms. Ubuntu 20, FreeBSD 13, and macOS 14 are too old.

@moonchen moonchen marked this pull request as draft September 22, 2025 18:04

// Start the watchdog
int watchdog_timeout_ms = RecGetRecordInt("proxy.config.thread_watchdog.timeout_ms").value_or(1000);
watchdog = std::make_unique<Watchdog::Monitor>(eventProcessor.thread_group[ET_NET]._thread,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this optional, as in, if proxy.config.thread_watchdog == 0, we don't setup the watchdog ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@moonchen moonchen marked this pull request as ready for review October 24, 2025 18:27
@moonchen
Copy link
Contributor Author

[approve ci autest]

@moonchen moonchen requested review from Copilot and zwoop November 7, 2025 05:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2143 to +2145
watchdog = std::make_unique<Watchdog::Monitor>(eventProcessor.thread_group[ET_NET]._thread,
static_cast<size_t>(eventProcessor.thread_group[ET_NET]._count),
std::chrono::milliseconds{watchdog_timeout_ms});
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watchdog is created immediately after eventProcessor.start() is called (line 2132), but the ET_NET threads may not have fully initialized their heartbeat state yet. Since heartbeat_state members are initialized to sentinel values (time_point::min() and seq{0}), the watchdog should either wait for threads to be ready or the initialization order should be documented to prevent potential timing issues during startup.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentinel values are chosen this way to tolerate delays in the startup of ET_NET threads. See this code in watchdog.cc:

if (last_sleep == std::chrono::steady_clock::time_point::min()) {
        // initial value sentinel - event loop hasn't started
        continue;
      }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants